E02 - Intro to R (Analysis of Gene Expression @ UCT Prague)


Copy, please, these files and directories to your personal directory:

cp -r ~/shared/AGE_current/Exercises/E02-intro_to_R ~/AGE/Exercises

Assignment

The assignment for this exercise is located in E02-intro_to_R/assignment.


RMarkdown

First you will learn how to use RMarkdown. Please, open intro_to_Rmd.Rmd in RStudio. You can also open its rendered version intro_to_Rmd.html in your web browser.


The must read sections

Take this material as an overview of good R techniques and packages. It is quite long, but some sections are really necessary to go through, as we will utilize them in the following exercises and assignments:

Don’t be just observers and try to modify the examples, so you will understand how things work! 😉


Reproducible R

Well, this could be a chapter on its own, but only a few important tips will be given, basically to avoid things like these:

  • setwd("C:\Users\jenny\path\that\only\I\have") will obviously work only on the computer of creator.
  • rm(list = ls()) will not perform the full restart of R, it will just remove user-defined objects. In addition, what if I accidentally run your script and it will remove a week of hard computations on my computer?

Almost all aspects of using R for reproducible research are summarized in these slides.

Project-oriented workflow

As you know, we have created the RStudio project for AGE exercises, with its top-level directory being ~/AGE/Exercises (you can find the Exercises.Rproj file there). Whenever you open a project in RStudio, it will automatically set the working directory to its top-level.

Now that you have this information, you can think of “I will stop using setwd() and always use paths relative to top-level directory”. Unfortunately, some packages are evil and set your working directory silently, which leads to unexpected and hard-to-debug errors (one example is knitr described in intro_to_Rmd.Rmd). Luckily, for a consistent navigation in your projects, here package was created.

Let’s see how it works. The main function is here::here():

library(here)
here()
## [1] "/data/persistent/jirinovo/AGE/Exercises"
here("E02-intro_to_R", "images")
## [1] "/data/persistent/jirinovo/AGE/Exercises/E02-intro_to_R/images"

It joins paths relative (E02-intro_to_R/images) to the project’s top-level directory with its absolute path (/data/persistent/jirinovo/AGE/Exercises). But what happens if we change the working directory?

# Just for educational purposes!
setwd("E01-intro")
cat("Current working directory:\n")
getwd()
Current working directory:
[1] "/data/persistent/jirinovo/AGE/Exercises/E01-intro"
cat("here():\n")
here()
here():
[1] "/data/persistent/jirinovo/AGE/Exercises"
cat("Some path:\n")
here("E02-intro_to_R", "images")
setwd(here())
Some path:
[1] "/data/persistent/jirinovo/AGE/Exercises/E02-intro_to_R/images"

Hmm, nothing happened after setwd(), right? This way, no hidden setwd() can surprise you 😎

How is it possible here always knows where is the top-level of project located? Well, here is smart: it scans your project directory structure and do some heuristics. The most basic one is searching for file ending with .here or .Rproj. Wait! Is that one of the reasons why we have created our RStudio project? Yes, exactly 🤓 However, for those which are not using RStudio (or its projects), this repository contains .here file in the root directory (Exercises/).

Besides .here and .Rproj files, here automatically recognizes some other files, see ?here::here().

Also, root directory can be set manually with here::i_am("<path to file>"), e.g. we can call here::i_am("path/to/age_library_empty.R") (this file is located in Exercises/ directory). Based on relative path this file is located in, here uses Exercises/ as the root directory.

You can inspect the reason why here is using a particular directory as the root one:

here::dr_here()

If you want to save space and typing, you can also define variables with here paths and use them to construct paths relative to them:

output_dir <- here("some/output/directory")
here(output_dir, "analysis_1")
## [1] "/data/persistent/jirinovo/AGE/Exercises/some/output/directory/analysis_1"

You can find more information about project-oriented workflow here and about safe here paths here and here.

Working directory in RMarkdown code chunks

By default, RMarkdown code chunks in RStudio are evaluated in the directory of opened Rmd file. That is violating our project-oriented workflow described above, but it can be changed in RStudio options:

RMarkdown code chunks should be evaluated in the project’s root directory.

The same applies for rmarkdown::render() (respectively for knitr), but can be changed (before you call render()) with:

opts_knit$set(root.dir = here())

Or directly in rmarkdown::render(knit_root_dir = here::here()).

Put runtime information to the end of your script/Rmd file

It is a good practice to do so for the sake of reproducibility. This code outputs all important information about your R session (see the Cleanup section for actual rendered output):

  • Warnings.
  • Traceback of errors.
  • Loaded packages and their versions, external libraries (BLAS etc.).
warnings()
traceback()
sessionInfo()

Instead of sessionInfo() you can utilize the sessioninfo package, which offers a prettier output:

sessioninfo::session_info()

Do not automatically reload your R session

Otherwise you can expect the unexpected.

Because you are using R through RStudio Server, don’t worry when you close your browser - R session will remain running.

Prevent package namespace conflicts

When you load a package with library(), all exported objects from that packages are attached to the global namespace, while replacing existing objects. Let’s see an example (restart your R session before). Function named select() is very common, for example it is present in these two packages:

AnnotationDbi::select
## standardGeneric for "select" defined from package "AnnotationDbi"
## 
## function (x, keys, columns, keytype, ...) 
## standardGeneric("select")
## <bytecode: 0x55ff7fb07d40>
## <environment: 0x55ff7fb25530>
## Methods may be defined for arguments: x
## Use  showMethods(select)  for currently available ones.
dplyr::select
## function (.data, ...) 
## {
##     UseMethod("select")
## }
## <bytecode: 0x55ff8468e018>
## <environment: namespace:dplyr>

If we use package prefix to locate the select() function (AnnotationDbi:: or dplyr::), no problems can arise.

But see what happens when we load these two packages:

library(AnnotationDbi)
library(dplyr)
select
function (.data, ...) 
{
    UseMethod("select")
}
<bytecode: 0x5558f010cba8>
<environment: namespace:dplyr>

You see that select() from dplyr won, because this package was loaded after AnnotationDbi. This could lead to unexpected results: you think you are using a function from a package, but instead you are using a function with the same name, but from an another package, which was loaded last.

To prevent such confusions, we can use the conflicted package. It is a sort of guardian and dictator in one person: it throws error when you attempt to use a function with ambiguous name, and also allows you to select a preferred package for such conflicts. Let’s see it in action (restart your R session again):

library(conflicted)
library(AnnotationDbi)
library(dplyr)
select()
Error: [conflicted] `select` found in 2 packages.
Either pick the one you want with `::` 
* dplyr::select
* AnnotationDbi::select
Or declare a preference with `conflict_prefer()`
* conflict_prefer("select", "dplyr")
* conflict_prefer("select", "AnnotationDbi")
Run `rlang::last_error()` to see where the error occurred.

You see that select is conflicting and conflicted offers you two options how to resolve it:

  • Use the package prefix:
dplyr::select
  • Or use conflict_prefer() to select a preferred package:
conflict_prefer("select", "dplyr")
select
[conflicted] Will prefer dplyr::select over any other package

function (.data, ...) 
{
    UseMethod("select")
}
<bytecode: 0x56350f4713e8>
<environment: namespace:dplyr>

conflicted should be loaded as the first package.

Unfortunately, I encountered some RStudio freezes during the usage of conflicted. In that case refer to the section “What to do when RStudio is stuck” in E01-intro.

Alternatively, the following packages offer a clearer mechanism of package loading. They allow to import only specific objects from a package, similar to Python’s from package import function as fn


Use renv to capture your project dependencies

This is really not needed for AGE, but anyway it is a good practice to somehow export all packages needed to run your project, and renv is a great tool for that. It basically does two things:

  • Creates a separate R library for your project. This way, all packages are installed into this library instead of the system one. So when some project needs a particular package version, you won’t mess up your system library.
  • It searches your project files for library() and package:: usages and exports packages with exact versions to a file called renv.lock. Based on this file, everyone is able to install all packages needed for a project.

renv is somehow similar to virtual environments in conda 🙂


Use targets: a Make-like pipeline toolkit for R

The targets package is a Make-like pipeline toolkit for statistics and data science in R. With targets, you can maintain a reproducible workflow without repeating yourself. targets skips costly runtime for tasks that are already up to date, runs the necessary computation with implicit parallel computing, and abstracts files as R objects. A fully up-to-date targets pipeline is tangible evidence that the output aligns with the code and data, which substantiates trust in the results.

This is really out of scope of AGE, but targets is the way reproducible R analyses should be done.


Installing R packages

For bioinformatic purposes we will be using packages from Bioconductor, which groups together packages for bioinformatics. There are several core packages developed by Bioconductor’s team which share the same philosophy (API) and are used by other third-party packages. For example, SummarizedExperiment is used for storing experiment data (expression matrix, phenotype data, genomic ranges).

Bioconductor packages are installed through BiocManager::install() function. However, because some Bioconductor packages can interefere with packages from R main repository CRAN, always use BiocManager::install() to install packages, regardless their source repository.

It is also possible to install packages from GitHub repository with BiocManager::install("user/repository"), for example BiocManager::install("tidyverse/stringr"). BiocManager recognizes / as a GitHub source and uses devtools::install_github() underneath.

If you run BiocManager::install(), it will detect outdated packages and ask you whether to update them. However, some of them are system-wide and must be installed from R running with root privileges, see below.

If you install a package from R session launched from user shell (that is the R session in RStudio), it will be installed to your personal library in home directory. However, some packages must be installed to system library (usually the base R packages like Matrix), so you need to run R as root user from terminal (sudo R) and then install/update packages.


Debugging R code

Because R has dynamic typing and its error tracing is not as good as in e.g. Python, its debugging could be very painful. Here are some tips for saving your mental health (or refer to this):

traceback()

When error occurs, its message is usually not very informative. traceback() function prints a call stack of last error, helping you to resolve a location of the error.

Let’s define the group of functions, which call each other, starting from f():

f <- function(x) g(x)
g <- function(y) h(y)
h <- function(z) {
  z**2
}

Now we can try calling f(). It will fail for "2", obviously:

f(2)
## [1] 4
f("2")
## Error in z^2: non-numeric argument to binary operator

We can inspect the call stack:

traceback()
3: h(y) at <text>#2
2: g(x) at <text>#1
1: f("2")

Also try running f("2") in your prompt. RStudio will directly offer you to show traceback or rerun code with debug (see below).

Interactive debugger

Sometimes you need more information from a location of error. Interactive debugger allows you to pause execution of a function and interactively explore its state. It is invoked by calling the browser() function:

f2 <- function(x) g2(x)
g2 <- function(y) h2(y)
h2 <- function(z) {
  browser()
  z**2
}

f2("2")

Interactive debugger can be also invoked by RStudio, as is shown above in the traceback() section.

Unfortunately, browser() is not working well from within code chunks, so rather run debugged function from the R command line (interactive session).

recover()

Another way to activate browser() is to use options(error = recover). Now when you get an error, you will get an interactive prompt that displays the traceback and gives you the ability to interactively debug inside any of the frames:

options(error = recover)
f("2")

To restore the default on-error behaviour, use dump.frames:

options(error = dump.frames)

Writing your own functions

Functions are the essential part of every programming language and you should already know how to write your own, so this will be just a quick overview.

A rather simplistic view defines functions as repeatable, (parametric) block of code which returns something. Let’s define the most basic function, which doesn’t have any arguments and doesn’t return anything:

fn <- function() {
  message("Hello world!")
}
fn()
## Hello world!

Let’s add a return value:

fn2 <- function() {
  return("Hello world!")
}
out <- fn2()
out
## [1] "Hello world!"

Function will return the last value, so this is equivalent:

fn2 <- function() {
  "Hello world!"
}

However, the difference is that return() will immediately exit the function.

Now we add some arguments:

fn3 <- function(x, y = 2) {
  return(x * y)
}
fn3(2)
## [1] 4
fn3(2, y = 5)
## [1] 10

Functions can have required arguments (in our case x) and optional arguments with default values (y = 2).

If you want to return multiple values, you can use list() or similar data type:

fn4 <- function(x, y = 2) {
  out <- list(x * y, x + y)
  return(out)
}
fn4(3)
## [[1]]
## [1] 6
## 
## [[2]]
## [1] 5

An important concept is function scope, which defines which variables can function access. Function can see variables in environment from which it was called. In this example, fn4() can see y defined in global environment:

fn5 <- function(x) {
  return(x * y)
}
y <- 3
fn5(2)
## [1] 6

However, functions cannot modify variables outside of their scope (besides global variables, but rather don’t use them). Let’s try to modify y within fn5():

fn6 <- function(x) {
  y <- 5
  return(x * y)
}
y <- 3
fn6(2)
## [1] 10
y
## [1] 3

You can see that by y <- 5 we only created a local variable y - the y from outer scope wasn’t modified.

As you will see in the next section, functions can be also anonymous, i.e. they are not assigned to any variable and just used once, just like that:

function(x) {
  x ** 2
}

Try to utilize vectorized operations

R’s fundamental feature is vectorization. All basic data types (numeric, character, etc.) are vectors and most of base functions are vectorized, including basic arithmetic operators:

(my_vector <- 1:5)
## [1] 1 2 3 4 5
my_vector + 5
## [1]  6  7  8  9 10
my_vector * 2
## [1]  2  4  6  8 10

So for example, for-loops like this are unnecessary in R:

my_vector <- 1:2e6

system.time({
  results <- list()
  for (i in my_vector) {
    results[[i]] <- my_vector[i] + 1
  }
})
##    user  system elapsed 
##   8.565   0.296   8.876

Compared to that, lapply() is much more faster:

rm(results)
system.time({
  results <- lapply(my_vector, function(x) x + 1)
})
##    user  system elapsed 
##   1.996   0.067   2.066

Anyway in this sort of dumb example, you should use the vectorized addition operation instead:

system.time(my_vector + 1)
##    user  system elapsed 
##   0.000   0.015   0.015

Generally, rather try to avoid for-loops, because:

For-loops have side effects

  • They leave iterating variable in environment. From the example above, the variable i still has value of the last element of my_vector: 2000000
  • They can change other variables from within the loop, which could lead to hard to find bugs.

For-loops could be slow

Those are mostly cases when in each loop you add something new to a vector or matrix, which, then, needs to be re-sized, causing a re-allocation of memory. Two great posts on this topic are here and here (chapter 4).

Taking into account the pre-allocation of output list, we can rewrite the for-loop above as:

system.time({
  results <- vector(mode = "list", length = length(my_vector))
  for (i in my_vector) {
    results[[i]] <- my_vector[i] + 1
  }
})
##    user  system elapsed 
##   3.999   0.032   4.036

Now it is even faster than lapply() 😲. But as you can see, you have to take care of pre-allocation, which is something apply-family functions are doing for you (see below).

For-loops cannot be easily parallelized

There are several packages, which offer an easy parallelization of apply-family functions (see below).

So, what to use in R instead of for-loops?

Vectorized functions

Some examples of vectorized functions other than arithmetic ones:

my_vector <- 1:5
mean(my_vector)
## [1] 3
sum(my_vector)
## [1] 15
cumsum(my_vector)
## [1]  1  3  6 10 15
ifelse(my_vector < 2, "< 2", ">= 2")
## [1] "< 2"  ">= 2" ">= 2" ">= 2" ">= 2"

Some functions operating on matrix row or column vectors:

head(USArrests)
colSums(USArrests)
##   Murder  Assault UrbanPop     Rape 
##    389.4   8538.0   3277.0   1061.6
rowMeans(USArrests)
##        Alabama         Alaska        Arizona       Arkansas     California 
##         82.100         91.375        103.275         67.075        104.150 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##         82.150         50.350         82.925        115.575         78.550 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##         38.625         47.700         91.600         51.550         31.625 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##         51.250         46.750         88.150         35.975        101.525 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##         63.675         94.050         38.900         84.050         71.300 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##         46.100         46.200         97.800         31.150         68.550 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##         99.625         94.300        102.775         24.275         55.925 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##         61.400         65.050         49.800         68.175         90.975 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##         36.900         71.775         79.800         56.525         23.350 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##         62.050         62.050         33.750         33.100         60.850

Apply-family functions

These functions take data (vector, list, dataframe, …) as an input and apply a function on each of its element. Let’s introduce the most basic ones:

sapply() apply a function to each element of a vector or list and returns a vector:

sapply(my_vector, function(x) {
  x ** 2
})
## [1]  1  4  9 16 25

lapply() is similar, but returns a list instead of vector. It can be also used for dataframes, where a function is applied to each of dataframe’s column (dataframe is basically a list of column-lists):

lapply(my_vector, function(x) {
  x ** 2
})
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 4
## 
## [[3]]
## [1] 9
## 
## [[4]]
## [1] 16
## 
## [[5]]
## [1] 25
lapply(mtcars, head)
## $mpg
## [1] 21.0 21.0 22.8 21.4 18.7 18.1
## 
## $cyl
## [1] 6 6 4 6 8 6
## 
## $disp
## [1] 160 160 108 258 360 225
## 
## $hp
## [1] 110 110  93 110 175 105
## 
## $drat
## [1] 3.90 3.90 3.85 3.08 3.15 2.76
## 
## $wt
## [1] 2.620 2.875 2.320 3.215 3.440 3.460
## 
## $qsec
## [1] 16.46 17.02 18.61 19.44 17.02 20.22
## 
## $vs
## [1] 0 0 1 1 0 1
## 
## $am
## [1] 1 1 1 0 0 0
## 
## $gear
## [1] 4 4 4 3 3 3
## 
## $carb
## [1] 4 4 1 1 2 1

apply() takes as an input dataframe or matrix and applies a function on either rows or columns, returning a single value for each of them. Row-apply and column-apply are specified by MARGIN parameter: 1 for rows and 2 for columns.

apply(mtcars, MARGIN = 1, mean)
##           Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive 
##            29.90727            29.98136            23.59818            38.73955 
##   Hornet Sportabout             Valiant          Duster 360           Merc 240D 
##            53.66455            35.04909            59.72000            24.63455 
##            Merc 230            Merc 280           Merc 280C          Merc 450SE 
##            27.23364            31.86000            31.78727            46.43091 
##          Merc 450SL         Merc 450SLC  Cadillac Fleetwood Lincoln Continental 
##            46.50000            46.35000            66.23273            66.05855 
##   Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla 
##            65.97227            19.44091            17.74227            18.81409 
##       Toyota Corona    Dodge Challenger         AMC Javelin          Camaro Z28 
##            24.88864            47.24091            46.00773            58.75273 
##    Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa 
##            57.37955            18.92864            24.77909            24.88027 
##      Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
##            60.97182            34.50818            63.15545            26.26273
apply(mtcars, MARGIN = 2, mean)
##        mpg        cyl       disp         hp       drat         wt       qsec 
##  20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750 
##         vs         am       gear       carb 
##   0.437500   0.406250   3.687500   2.812500

Parallelization

There are several packages offering parallelization of common apply-family functions, for example parallel (part of base R), doParallel, future.apply or BiocParallel (described in more detail in this section).

Just a brief example of parallel lapply() using two CPU cores:

parallel::mclapply(my_vector, mc.cores = 2, FUN = function(x) {
  x ** 2
})

Quite the same code as for lapply(), right? Just be aware that parallelization doesn’t always bring you the desired speedup, because first, computation environments have to be setup. In short, objects need to be copied to a new thread or R process and that brings some overhead.


Introduction to the tidyverse

Until now you have been probably using only the base R. The tidyverse can be seen as a dialect of R and is described as follows:

The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.

Tidyverse is very popular these days, but as some (here and here) have noted, it is not suitable for R beginners, meaning you should first know how base R is working and only then learn tidyverse.

Useful links

Tidy (long) data

Tidyverse packages are not meant to replace the base R. Especially, they are not suited for matrix manipulation, and are specifically used for long data (also called tidy data). More about tidy data can be found here.

The following three rules make a dataset tidy:

  • Each variable must have its own column.
  • Each observation must have its own row.
  • Each value must have its own cell.

Following three rules makes a dataset tidy: variables are in columns, observations are in rows, and values are in cells.

As you will see, we will be mostly working with biological matrices (= data in wide format). For example, consider this datafame:

(df_wide <- data.frame(sample_1 = 1:2, sample_2 = 3:4, sample_3 = 5:6, row.names = c("gene_A", "gene_B")))

For each sample (column) and each gene (row) we have its value (cell). This form of data is usable for functions operating on matrices (e.g. PCA), but as you will see later, not for e.g. plotting (besides base R plotting).

This dataframe is wide, and thus it is not tidy. You can see that rownames and colnames are, in fact, values of variables Gene and Sample, respectively. So how would this dataframe look like in tidy format? Don’t be surprised now, we will get to this code later.

library(magrittr)
library(tidyverse)

df_long <- df_wide %>%
  tibble::rownames_to_column("Gene") %>%
  tidyr::pivot_longer(-Gene, names_to = "Sample", values_to = "Count")
df_long

Now df_long is tidy and corresponds to long format of the original dataframe. We actually did this:

Converting wide data to long format.

Why is long data useful for some purposes (mainly for plotting)? Well, for biological matrix like the one in df, you usually also have a sample sheet, for example:

(sample_sheet <- data.frame(Sample = c("sample_1", "sample_2", "sample_3"), Age = 1:3, Treatment = c("yes", "no", "yes")))

Would it be neat to have this information alongside the long data (df_long)? I think so:

dplyr::left_join(df_long, sample_sheet, by = "Sample")

As you will see later, this long form of data is extremely useful for plotting. In this blog post, there is a nice example of using tidy data and tidyverse in genomics.

Non-standard evaluation (NSE) / tidy evaluation

Let’s go ahead a bit and look at this select() function from dplyr package (will be introduced later):

dplyr::select(mtcars, cyl, mpg) %>% head()

You can see that select() is doing… well, as its name suggests, it is selecting columns of a dataframe. But it has a strange syntax, hasn’t it? We have used variables, which are not defined in our current environment. Don’t be scared by this dark magic - it is based on R’s lazy evaluation (details) and it is called non-standard evaluation (NSE). In short, R only evaluates the expression if the expression is actually used. This particular behaviour could be then “exploited”, such as in the tidyverse for tidy evaluation (which is a special type of NSE). Just a quick example:

my_fn <- function(x, y, z) {
  print("Hello!")
}

my_fn(a, b, c)
## [1] "Hello!"

You can see we are passing non-existent variables a, b and c to my_fn(), but no errors are thrown! That is because R is evaluating variables (or more generally expressions) only when they are needed (evaluated), i.e. lazily.

Also R has quite advanced metaprogramming abilities, so inside a function, we can do things like: “Get the name (symbol) of a variable, which was passed to the function.”. So for example, you can find out that you passed variable named a as the argument x and expression mean(b) as the argument y:

my_fn <- function(x, y, z) {
  print(substitute(x))
  print(substitute(y))
}

my_fn(a, mean(b), c)
## a
## mean(b)

Magic! 🧙‍♂️ Having this knowledge, we can, for example, return a column from dataframe:

my_fn <- function(df, column) {
  df[, colnames(df) == substitute(column)]
}

my_fn(mtcars, cyl)
##  [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4

This is very roughly how most tidyverse functions work. They are using NSE/tidy evaluation to look first at variables inside a passed object, which is usually data.frame-like (where columns = variables). Most of these functions are designed to have their first parameter as data object. When variables are not found inside the object, a function will look for them in the current environment (this can cause unwanted problems, when object has variable with the same name as some in the current environment, see the solution below). So in the first example above, cyl and mpg variables are evaluated in the context of mtcars object and environment defined in select() function call (which is the global environment). This particular feature of tidyverse functions is called data-masking.

To specify that a variable comes from the current environment (i.e not from the passed object), use the bang-bang operator !!, which forces evaluation of a variable. That means its value will be used instead of its name, preventing the data-masking usage.

cyl <- c(4, 6)
dplyr::filter(mtcars, cyl %in% !!cyl)
fn <- function(data, cyl) {
  dplyr::filter(data, cyl %in% !!cyl)
}

fn(mtcars, cyl)

NSE is a quite advanced topic and its knowledge is mainly needed for implementation of custom functions, which share the tidyverse philosophy. See vignette("programming") and friendlyeval, a friendly interface for NSE programming. I also suggest watching “Tidy evaluation in 5 mins” from Hadley Wickham, the father of the tidyverse. There is also a very practical interactive tutorial on tidy evaluation. The basic programming tips are given in the next section.

Of note, there are some base R functions which also use NSE. Some examples:

with(mtcars, cyl ** 2) %>% head()
## [1] 36 36 16 36 64 36
subset(mtcars, hp < 90)

Using dynamic variable names

Here are just quick examples how to use dynamic variable names inside custom functions.

  • For the tidyverse style (unquoted variable names), use double braces {{ variable }}:
dist_summary <- function(df, var) {
  dplyr::summarise(df, n = n(), min = min({{ var }}), max = max({{ var }}))
}
dist_summary(mtcars, mpg)
  • To use variable names as character vectors (scalars), use the .data pronoun:
dist_summary_chr <- function(df, var) {
  dplyr::summarise(df, n = n(), min = min(.data[[var]]), max = max(.data[[var]]))
}
dist_summary_chr(mtcars, "mpg")

It can be also used outside of a function:

my_var <- "mpg"
dplyr::summarise(mtcars, n = n(), min = min(.data[[my_var]]), max = max(.data[[my_var]]))

Now we will quickly look at the most popular tidyverse packages.


magrittr - a pipe operator

The pipe operator is probably the basic compound of shell programming. Fortunately for us, it is also implemented in R package magrittr.

The pipe operator will become an official part of the R in the next version 4.1.0! Anyways, besides a bit worse performance, magrittr will still offer more flexibility.

The pipe operator (see ?`%>%`) is forwarding output from the left hand side (LHS) to input on the right hand side (RHS). The default behaviour is to pipe output to the first argument of a RHS function.

So instead of using nested functions like

head(rownames(mtcars))
## [1] "Mazda RX4"         "Mazda RX4 Wag"     "Datsun 710"       
## [4] "Hornet 4 Drive"    "Hornet Sportabout" "Valiant"

you can use the pipe operator:

library(magrittr)
rownames(mtcars) %>% head()
## [1] "Mazda RX4"         "Mazda RX4 Wag"     "Datsun 710"       
## [4] "Hornet 4 Drive"    "Hornet Sportabout" "Valiant"

And there are even more types of pipes.

Piped object is available on the RHS as the . object. It is used when you want:

  • To use an argument different than first of the RHS function:
c("first", "second", "third") %>% paste("I am", ., sep = " ")
## [1] "I am first"  "I am second" "I am third"
  • To use the LHS output as a regular R object:
mtcars %>% .$cyl
##  [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
mtcars %>% .[1:5, 1:3]
  • To use a lambda expression, which behaves like a function with a single parameter .. Don’t wrap your head around - just wrap an expression on the RHS to {} if you need to use nested functions, e.g. mtcars %>% {head(rownames(.))}
mtcars %>% {head(.$cyl)}
## [1] 6 6 4 6 8 6
mtcars %>% {
  my_data <- .
  c(min(my_data), max(my_data))
}
## [1]   0 472

Or you can define an anonymous function:

mtcars %>% (function(my_data) c(min(my_data), max(my_data)))
## [1]   0 472

Using arithmetic and other operators together with pipe can be more pleasant with provided aliases.

Add and subtract operators are working fine:

1 %>% + 2 %>% - 3
## [1] 0

But for the sake of consistency, rather use operator aliases:

1 %>% add(2) %>% subtract(3)
## [1] 0

It is not possible to use some operators without aliases, e.g. == or *:

1 %>% == 2
1 %>% * 3
1 %>% equals(2)
## [1] FALSE
1 %>% multiply_by(3)
## [1] 3

In fact, operators are called infix functions and all aliases are actually shorthands for writing operators as “classic” functions, e.g. `*`(x, y). So the examples above can be also written as:

1 %>% `==`(2)
## [1] FALSE
1 %>% `*`(3)
## [1] 3

But rather don’t do that 😄

If you want to check intermediate steps in a pipe chain, you can use debug_pipe() function. Try to run content of this chunk in R console (browser() doesn’t work well when invoked from chunk execution in RStudio):

mtcars %>%
  head() %>%
  debug_pipe() %>%
  dplyr::select(cyl, mpg)

debug_pipe() will send you to the debug console invoked by browser(), where you can inspect the variable passed from the LHS. When you want to continue the pipe chain execution, just send c to console or in RStudio click on the “Continue” button on the top of console.


tibble - an enhanced data.frame

There are some differences between tibble and the base R’s data.frame. A comparison of these data structures can be found here and a short introduction to tibble here.

Anyway, the most visible difference is in printing of a tibble:

tibble::as_tibble(mtcars)

tibble is showing the data types of variables. Also you can see that there aren’t rownames - this is strange, but it is one of the features of tibble. Authors of tibble package are convinced that rownames should be represented by a variable (column) and not by object’s attribute, accessible by rownames() method.

This is bad particularly for usage of tidyverse together with Bioconductor, where rownames are an essential component. Many tidyverse functions return a tibble or drop rownames, but most of times we rather want data.frame with defined rownames. A tibble can be converted to data.frame by the as.data.frame() function.

To preserve rownames in a new column, use the tibble::rownames_to_column(df, "name_of_new_column") function.

Useful is the tribble() function for more readable creation of (small) tables within a code:

my_tibble <- tibble::tribble(
  ~column_A, ~column_B,
  "a",   1,
  "b",   2,
  "c",   3
)
my_tibble

tibble is also useful for handling of nested objects within lists:

my_tibble$column_C <- list(LETTERS, LETTERS, LETTERS)
my_tibble$column_D <- list(mtcars, mtcars, mtcars)
my_tibble
my_tibble$column_C[[1]]
##  [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
## [20] "T" "U" "V" "W" "X" "Y" "Z"
head(my_tibble$column_D[[1]])

You can see that column_C’s and column_D’s data type is <list>, which can hold any other objects. Pretty handy for e.g. nested dataframes - for example it is possible to group rows by a variable and nest them to separate dataframes:

mtcars_nested <- mtcars %>%
  dplyr::group_by(cyl) %>%
  tidyr::nest()

mtcars_nested

So this is a nested dataframe for cyl == 4:

dplyr::filter(mtcars_nested, cyl == 4) %>%
  dplyr::pull(data) %>%
  .[[1]]

dplyr - data manipulation

dplyr is probably the most popular package from tidyverse. It helps to solve the most common data manipulation challenges and its sibling dbplyr also supports remote data stored in a database (MySQL, PostreSQL, SQLite, and others).

Some functions in dplyr have names similar to functions from other packages. Thus, it is recommended to always specify the dplyr:: namespace.

We will use a shortened mtcars dataset for examples:

mtcars_sm <- head(mtcars)

select() - select (filter) columns

dplyr::select(mtcars_sm, cyl, mpg)

By using - you can also specifify columns to omit:

dplyr::select(mtcars_sm, -cyl, -mpg)

We can select a range of variables. Numerical range is also supported, e.g. 1:5. Omitting a range is also possible - in this example you would use -(disp:drat).

dplyr::select(mtcars_sm, disp:drat)

There are also very useful select helpers. For column omitting they can be also combined with -.

Select columns starting with “c”:

dplyr::select(mtcars_sm, starts_with("c"))

Select columns containing “ar”:

dplyr::select(mtcars_sm, contains("ar"))

Select column at last - 1 position:

dplyr::select(mtcars_sm, last_col(1))

Use any_of() to select columns contained in a character vector without throwing an error for a non-existing column (this can be used for dynamic column names):

dplyr::select(mtcars_sm, any_of(c("mpg", "cyl", "garbage")))

While this will fail:

dplyr::select(mtcars_sm, mpg, cyl, garbage)
## Error in `dplyr::select()`:
## ! Can't subset columns that don't exist.
## x Column `garbage` doesn't exist.

A strict version of any_of() is all_of(), which requires all columns to exist. Both of these selection functions can be combined with negative selection (-).

select() can also select and rename columns at the same time:

dplyr::select(mtcars_sm, cylinders = cyl, horse_power = hp)

To only rename specific and keep other columns, use the rename() function:

dplyr::rename(mtcars_sm, cylinders = cyl, horse_power = hp)

Using the everything() select helper also works, but you will lose the original column order:

dplyr::select(mtcars_sm, cylinders = cyl, horse_power = hp, everything())

Also, you can use the relocate() to change order of columns:

dplyr::relocate(mtcars_sm, cyl)

It is basically a shortcut for dplyr::select(mtcars, cyl, everything()), but in addition, in relocate() you can use the .before and .after parameters that specify the relative position of relocated columns:

dplyr::relocate(mtcars_sm, cyl, .after = hp)

select() always returns an object of the same type as input. However, sometimes you rather need a vector: pull() returns a vector of specified variable.

dplyr::pull(mtcars_sm, cyl)
## [1] 6 6 4 6 8 6

arrange() - order rows

arrange() orders rows by specified variables.

dplyr::arrange(mtcars_sm, cyl, disp)

Use desc() or - for descending order:

dplyr::arrange(mtcars_sm, dplyr::desc(cyl))
# or
# dplyr::arrange(mtcars_sm, -cyl)

filter() - filter rows

In fact, filter() is very similar to base R’s function subset(), but offers much more functionality.

Multiple arguments are equivalent to AND operator:

dplyr::filter(mtcars_sm, hp > 100, cyl == 6)

So this is equivalent:

dplyr::filter(mtcars_sm, hp > 100 & cyl == 6)

Similarly, we can use other logical operators, e.g. OR:

dplyr::filter(mtcars_sm, hp > 100 | cyl == 6)

Thanks to NSE we can do some magic. Here, we want rows where hp is greater than mean value of hp minus 10:

dplyr::filter(mtcars_sm, hp > mean(hp) - 10)

Important is to realize that expression inside filter() must return a logical vector of the same length as the number of rows of input dataframe, or scalar TRUE or FALSE. Let’s see what happens if we violate this rule:

dplyr::filter(mtcars, c(TRUE, FALSE))
## Error in `dplyr::filter()`:
## ! Problem while computing `..1 = c(TRUE, FALSE)`.
## x Input `..1` must be of size 32 or 1, not size 2.
dplyr::filter(mtcars, 1:nrow(mtcars))
## Error in `dplyr::filter()`:
## ! Problem while computing `..1 = 1:nrow(mtcars)`.
## x Input `..1` must be a logical vector, not a integer.

mutate() - add new or modify existing variables

Named parameters of mutate() function create new variables (or modify existing, if a parameter name is same).

tibble::rownames_to_column(mtcars_sm, "car_name") %>%
  dplyr::mutate(cyl2 = cyl * 2) %>%
  dplyr::select(car_name, cyl, cyl2)

mutate() is one of the tidyverse functions, which silently drop rownames. To preserve them, we have moved rownames to the new column car_name.

This is how we remove the column and modify the existing one:

dplyr::mutate(mtcars_sm, mpg = NULL, cyl = cyl * 2)

Very useful are these vectorized if functions:

if_else() checks for a condition and returns values for TRUE and FALSE matches. In missing parameter you can also specify a return value for NA values. if_else() is similar to base R’s ifelse(), but checks for the same return types.

dplyr::mutate(mtcars_sm, car_category = if_else(hp >= 100, "strong", "weak")) %>%
  dplyr::select(hp, car_category)

You can see that return values must have the same type:

dplyr::mutate(mtcars_sm, car_category = if_else(hp >= 100, "strong", 50)) %>%
  dplyr::select(hp, car_category)
## Error in `dplyr::mutate()`:
## ! Problem while computing `car_category = if_else(hp >= 100, "strong",
##   50)`.
## Caused by error in `if_else()`:
## ! `false` must be a character vector, not a double vector.

case_when() allows to specify multiple if statements. Statements are given in a format condition ~ return_value and they are matched from first to last.

For else statement put TRUE ~ some_value as the last statement (if not specified, some_value will be NA). As case_when() is checking for return types, you have to use appropriate NA type: NA_integer_, NA_real_, NA_complex_ or NA_character_. Be careful since 1 is a real number (double type), whereas 1L is a true integer.

dplyr::mutate(
  mtcars,
  car_category = case_when(
    hp < 80 ~ "weak",
    hp >= 80 & hp <= 100 ~ "middle",
    hp > 100 ~ "strong"
  )
) %>%
  dplyr::select(hp, car_category)

recode() replaces values, similar to switch(). You can specify default return value for not-specified and NA values (.default and .missing parameters). Very useful for replacing factor values while preserving their order.

dplyr::mutate(mtcars_sm, am_char = recode(am, `0` = "zero", `1` = "one")) %>%
  dplyr::select(am, am_char)

recode_factor() create factors with levels ordered as they appear in the recode call.

factor(mtcars_sm$am) %>%
  levels()
## [1] "0" "1"
dplyr::mutate(
  mtcars_sm,
  am = factor(am),
  am_char = recode_factor(am, `1` = "one", `0` = "zero")
) %>%
  dplyr::pull(am_char) %>%
  levels()
## [1] "one"  "zero"

You can notice that mutate() parameters are evaluated from first to last. So to am_char is already passed am as factor, as am is mutated before am_char.

Of course you can use all of these functions outside of mutate():

v <- c(1, 1, 2, 3, 10)
if_else(v < 2, "< 2", ">= 2")
## [1] "< 2"  "< 2"  ">= 2" ">= 2" ">= 2"
(am_factor <- factor(mtcars_sm$am))
## [1] 1 1 1 0 0 0
## Levels: 0 1
recode(am_factor, `0` = "zero", `1` = "one")
## [1] one  one  one  zero zero zero
## Levels: zero one
recode_factor(am_factor, `1` = "one", `0` = "zero")
## [1] one  one  one  zero zero zero
## Levels: one zero

group_by() and summarise()

group_by() is similar to SQL operation of the same name. It takes an existing dataframe and converts it into a grouped dataframe in which operations are performed “by group”.

by_cyl <- dplyr::group_by(mtcars, cyl)
by_cyl

Better is to see the original tibble in your console, where is information about groups.

Now we can perform operations on groups. summarise() reduces multiple values down to a single value. Let’s calculate some statistics in the cyl groups:

dplyr::summarise(
  by_cyl,
  n = n(),
  mpg_mean = mean(mpg),
  hp_max = max(hp),
  qsec_median = median(qsec),
  wt_sd = sd(wt)
)

n() is a special function to get counts inside groups.

filter() is now acting on groups. Let’s compare these two filter operations:

dplyr::filter(mtcars, mpg > mean(mpg))
dplyr::filter(by_cyl, mpg > mean(mpg))

The former keeps rows with mpg greater than the global average whereas the latter acts within the cyl groups, and, thus, keeps rows with mpg greater than the average inside the cyl groups.

To remove a grouping, use the ungroup() function.

*_join() and bind_rows() - joining dataframes

A collection of *_join() functions acts in the same way as in SQL - based on common keys, they join two dataframes together (column wise).

See these beautiful animations of joins and this chapter in R for Data Science for more explanation.

mtcars_prices <- tibble::tribble(
  ~cyl, ~price,
  4,    1000,
  6,    5000,
  8,    10000
)

dplyr::left_join(mtcars_sm, mtcars_prices, by = "cyl") %>%
  dplyr::select(cyl, price)

bind_rows() appends rows to a dataframe. It puts NAs to columns not present in both dataframes.

dplyr::bind_rows(mtcars_sm, mtcars_prices) %>%
  dplyr::select(cyl, price, everything())

tidyr - tools for tidy data

The goal of tidyr is to help you to create tidy data. Just to remind you, tidy data is data where:

  • Every column is a variable.
  • Every row is an observation.
  • Every cell is a single value.

Cheatsheet on reading and tidying data.

Let’s look at some of basic tasks tidyr can do:

Pivotting

Pivotting converts between long and wide forms. We will reuse the wide df_wide dataframe.

pivot_longer() “lengthens” data, increasing the number of rows and decreasing the number of columns.

df_wide
df_long <- df_wide %>%
  tibble::rownames_to_column("Gene") %>%
  tidyr::pivot_longer(-Gene, names_to = "Sample", values_to = "Count")
df_long

We have added rownames of df_wide to the new column Gene. In pivot_longer() we specified by -Gene we want to lengthen all other columns.

Of course more than one column can be lengthened. Let’s add more columns to df_wide:

df_wide2 <- df_wide %>%
  tibble::rownames_to_column("Gene_symbol") %>%
  dplyr::mutate(
    Gene_ID = c("ENSG0001", "ENSG0002"),
    Gene_class = c("protein_coding", "miRNA")
  )

df_wide2
df_long2 <- df_wide2 %>%
  tidyr::pivot_longer(-c(Gene_symbol, Gene_ID, Gene_class), names_to = "Sample", values_to = "Count")

df_long2

We can see gene columns have the same prefix, and so we can utilize that: select helpers are allowed here.

df_wide2 %>%
  tidyr::pivot_longer(-starts_with("Gene_"), names_to = "Sample", values_to = "Count")

We have omitted all columns starting with “Gene_”, and thus lengthening all other columns.

pivot_wider() “widens” data, increasing the number of columns and decreasing the number of rows.

tidyr::pivot_wider(df_long, names_from = "Sample", values_from = "Count")
tidyr::pivot_wider(df_long2, names_from = "Sample", values_from = "Count")

Splitting and combining character columns

separate() turns a single character column into multiple columns.

(df <- data.frame(x = c(NA, "a.b", "a.d", "b.c")))
tidyr::separate(df, x, c("A", "B"), sep = "\\.")

extract() turns each capturing group into a new column.

(df <- data.frame(x = c(NA, "a-1-x", "a-2-y", "a-3")))
tidyr::extract(df, x, c("A", "B", "C"), regex = "([a-z])-(\\d)-?([a-z])?")

unite() pastes together multiple columns into one.

(df <- data.frame(x = c(NA, "a", "b", "c"), y = c("a", "f", "g", "h")))
tidyr::unite(df, "x_y", x, y, sep = "_", remove = FALSE)

Other tidyr functions

drop_na() drops rows containing missing values.

(df <- data.frame(x = c(1, NA, 2), y = c(1, 2, NA)))
tidyr::drop_na(df)
tidyr::drop_na(df, x)

replace_na() replaces missing values.

For dataframes, specify a list of columns and replacements.

tidyr::replace_na(df, list(x = 5, y = 10))

For vectors, specify a replacement value.

tidyr::replace_na(c(1, NA, 3), 2)
## [1] 1 2 3

stringr - consistent wrappers for common string operations

stringr is a nice replacement for base R collection of string manipulation functions (e.g. grep(), gsub(), etc., see ?grep). In my opinion, stringr API is more user-friendly. All its functions start with str_ prefix and are vectorised over input string and pattern.

Useful links

  • Cheatsheet
  • RegExplain - RStudio addin providing a friendly interface for working with regular expressions and functions from stringr.
  • Regex101 - cool web application for regex testing.

Because stringr homepage offers a nice collection of examples, I will just copy-paste them here and possibly add more information.

library(stringr)

x <- c("why", "video", "cross", "extra", "deal", "authority")
str_length(x)
## [1] 3 5 5 5 4 9

str_c(…, sep = "", collapse = NULL) concatenates a vector of strings.

str_c(x, collapse = ", ")
## [1] "why, video, cross, extra, deal, authority"

str_sub(string, start = 1L, end = -1L) extracts a substring by a position. Positions can be also negative.

str_sub(x, 1, 2)
## [1] "wh" "vi" "cr" "ex" "de" "au"
str_sub(x, -2)
## [1] "hy" "eo" "ss" "ra" "al" "ty"

str_detect(string, pattern, negate = FALSE) tells you if there’s any match to the pattern. Handy in dplyr::filter().

str_detect(x, "[aeiou]")
## [1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
str_detect(x, "[aeiou]", negate = TRUE)
## [1]  TRUE FALSE FALSE FALSE FALSE FALSE

str_count(x, pattern) counts the number of patterns.

str_count(x, "[aeiou]")
## [1] 0 3 1 2 2 4

str_subset(string, pattern, negate = FALSE) extracts the matching components. str_which(string, pattern, negate = FALSE) gives indices of the matching components.

str_subset(x, "[aeiou]")
## [1] "video"     "cross"     "extra"     "deal"      "authority"
str_subset(x, "[aeiou]", negate = TRUE)
## [1] "why"
str_which(x, "[aeiou]")
## [1] 2 3 4 5 6

str_locate(string, pattern) gives the position of the match.

str_locate(x, "[aeiou]")
##      start end
## [1,]    NA  NA
## [2,]     2   2
## [3,]     3   3
## [4,]     1   1
## [5,]     2   2
## [6,]     1   1

str_extract(string, pattern) extracts the text of the match and returns a vector.

x
## [1] "why"       "video"     "cross"     "extra"     "deal"      "authority"
str_extract(x, "[aeiou]")
## [1] NA  "i" "o" "e" "e" "a"

str_match(string, pattern) extracts parts of the match defined by parentheses (capturing groups). Return value = matrix, with rows being input vector values, and columns being capturing groups. First column represents the full match.

x
## [1] "why"       "video"     "cross"     "extra"     "deal"      "authority"
str_match(x, "(.)[aeiou](.)")
##      [,1]  [,2] [,3]
## [1,] NA    NA   NA  
## [2,] "vid" "v"  "d" 
## [3,] "ros" "r"  "s" 
## [4,] NA    NA   NA  
## [5,] "dea" "d"  "a" 
## [6,] "aut" "a"  "t"

str_replace(string, pattern, replacement) replaces the matches with new text. replacement also supports references of the form \1, \2, etc., which will be replaced with the contents of the respective matched group (created by ()). str_remove(string, pattern) is an alias for str_replace(string, pattern, "").

str_replace(x, "[aeiou]", "?")
## [1] "why"       "v?deo"     "cr?ss"     "?xtra"     "d?al"      "?uthority"
str_replace_all(x, "[aeiou]", "?")
## [1] "why"       "v?d??"     "cr?ss"     "?xtr?"     "d??l"      "??th?r?ty"
str_replace(x, "(a)", "\\1_")
## [1] "why"        "video"      "cross"      "extra_"     "dea_l"     
## [6] "a_uthority"
str_remove(x, "[aeiou]")
## [1] "why"      "vdeo"     "crss"     "xtra"     "dal"      "uthority"
str_remove_all(x, "[aeiou]")
## [1] "why"   "vd"    "crss"  "xtr"   "dl"    "thrty"

str_split(string, pattern, n = Inf, simplify = FALSE) splits up a string into multiple pieces.

str_split(c("a,b", "c,d,e"), ",")
## [[1]]
## [1] "a" "b"
## 
## [[2]]
## [1] "c" "d" "e"

Regular expressions

A regular expression (shortened as regex or regexp) is a sequence of characters that specifies a search pattern. Usually such patterns are used by string-searching algorithms for “find” or “find and replace” operations on strings, or for input validation.

Regexes are essential to text processing (not only in R) and you should at least know their basics. For example, do you know what does the regex in str_replace(x, "[aeiou]", "?") above mean?

To check for a regex explanation, you can use e.g. regex101.com. There is also a handy RStudio addin RegExplain.

An example from regex101.com

Also, there are several packages offering a grammar for regexes, such as rex and RVerbalExpressions. They allow to construct regexes by using a sort of human-readable language to solve situations like “find something that must contain two words” (the code below is from RVerbalExpressions examples):

rx() %>% 
  rx_find("cat") %>% 
  rx_anything() %>% 
  rx_find("dog")

Just a note - backslash (\) character is used in R (and also in other programming languages) for escaping. Shortly, character after backslash has a special meaning (e.g. \n is used for newline). Because regexes themselves use backslashes (e.g. \d means “any digit”), you have to tell R to treat backslash as a normal character, and not as the start of escaping. This is done by using a double backslash (\\). Just keep in mind that sites like regex101.com expect single backslashes in regexes.


glue - string interpolation

glue offers the same functionality, which you already know from e.g. Bash ("$variable") or Python (f"{variable}"). It is much more comfortable than e.g. paste().

library(glue)
name <- "Fred"
age <- 50
anniversary <- as.Date("1991-10-12")
glue("My name is {name}, my age next year is {age + 1}, my anniversary is {format(anniversary, '%A, %B %d, %Y')}.")
## My name is Fred, my age next year is 51, my anniversary is Saturday, October 12, 1991.

Compare this code to implementation using paste0():

paste0("My name is ", name, ", my age next year is ", age + 1, ", my anniversary is ", format(anniversary, '%A, %B %d, %Y'), ".")
## [1] "My name is Fred, my age next year is 51, my anniversary is Saturday, October 12, 1991."

Personally, I think its quite confusing and prone to syntax errors.

Equivalently, in glue() we can break long string to multiple parameters on separate lines and use named arguments to assign temporary variables:

glue(
  "My name is {name},",
  " my age next year is {age + 1},",
  " my anniversary is {format(anniversary, '%A, %B %d, %Y')}.",
  name = "Joe",
  age = 40,
  anniversary = as.Date("2001-10-12")
)

Keep in mind that glue() is vectorized just like paste():

my_names <- c("John", "Adam", "Peter")
paste0("His name is ", my_names, ".")
## [1] "His name is John."  "His name is Adam."  "His name is Peter."
glue("His name is {my_names}.")
## His name is John.
## His name is Adam.
## His name is Peter.

Don’t be confused by a different appearance of glue characters when you print them. They are internally the same as character vectors.


purrr - functional programming tools

purrr is basically a handy replacement for lapply() and sapply(), but if you go in more depth, it offers much more functionality. We will just look at the most basic functions.

map() is similar to lapply(), but allows to use a special formula notation for functions. In this formula, iterated variable is referenced by . (dot). In the example below, ~ . + 1 is equivalent to function(x) {x + 1}.

my_list <- list(first = 1:3, second = 1:10, third = 20)
my_list
## $first
## [1] 1 2 3
## 
## $second
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## $third
## [1] 20
purrr::map(my_list, ~ . + 1)
## $first
## [1] 2 3 4
## 
## $second
##  [1]  2  3  4  5  6  7  8  9 10 11
## 
## $third
## [1] 21

map() or pluck() can be used to retrieve a specific subelement of each element. Here, we return the first element of each nested vector:

purrr::map(my_list, 1)
## $first
## [1] 1
## 
## $second
## [1] 1
## 
## $third
## [1] 20

Similarly, if nested lists or vectors are named, we can use their names instead of integer position.

my_named_list <- list(
  list(first = 1:2, second = "Hello", third = 20),
  list(first = 5, second = 1)
)
my_named_list
## [[1]]
## [[1]]$first
## [1] 1 2
## 
## [[1]]$second
## [1] "Hello"
## 
## [[1]]$third
## [1] 20
## 
## 
## [[2]]
## [[2]]$first
## [1] 5
## 
## [[2]]$second
## [1] 1
purrr::map(my_named_list, "second")
## [[1]]
## [1] "Hello"
## 
## [[2]]
## [1] 1

map() always returns a list, while its map_<type>() versions returns a vector of specified type (e.g. character - map_chr()).

purrr::map_dbl(my_list, mean)
##  first second  third 
##    2.0    5.5   20.0

Using a predicate function, keep() and discard() will filter elements:

purrr::keep(my_list, ~ all(. < 5))
## $first
## [1] 1 2 3
purrr::discard(my_list, ~ all(. < 5))
## $second
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## $third
## [1] 20
purrr::keep(my_named_list, ~ length(.) > 2)
## [[1]]
## [[1]]$first
## [1] 1 2
## 
## [[1]]$second
## [1] "Hello"
## 
## [[1]]$third
## [1] 20

ggplot2 - plotting made easy

ggplot2 is a package for declaratively creating graphics based on The Grammar of Graphics. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.

ggplot2 scheme

To create a plot, you start with initialization of a basic ggplot2 object holding data, and then you add a geom, which tells how to visualize the data and how to map variables to aesthetics (color, size, etc.). This will provide you sensible defaults. To further modify the plot, you can add more components with + operator.

Useful links

For professional tips regarding mainly, but not only, scientific visualizations, I would recommend Fundamentals of Data Visualization. This book can really help you to design figures in your papers.

ggplot2 is a very extensive package, and so you should mainly know its basic principles and search for details in the reference. We will just look at these basic principles, using the mtcars dataframe.

library(ggplot2)
library(ggcats)
head(mtcars)
ggplot(mtcars) +
  geom_cat(aes(x = cyl, y = disp), cat = "grumpy")

A professional figure, right? 😁

We have initialized the ggplot2 object with mtcars data and added geom_cat(), which represents a visualization of points (cats) in coordinate system. In this geom, we mapped x to cyl and y to disp variable, respectively.

Each geom_*() has available aesthetics to which you can map variables. Let’s scale cat size by carb variable:

ggplot(mtcars) +
  geom_cat(aes(x = cyl, y = disp, size = carb), cat = "grumpy")

Hmmm but it seems cyl is considered as a continuous variable… well, it is defined so:

head(mtcars$gear)
typeof(mtcars$gear)
## [1] 4 4 4 3 3 3
## [1] "double"

ggplot2 has very sensible defaults, so, for example, in case of colors it uses continuous or discrete scale, based on variable type. Let’s see what happens when we change cyl type to factor:

mtcars2 <- dplyr::mutate(mtcars, cyl = factor(cyl))
head(mtcars2$cyl)
## [1] 6 6 4 6 8 6
## Levels: 4 6 8
ggplot(mtcars2) +
  geom_cat(aes(x = cyl, y = disp, size = carb), cat = "grumpy")

See how x-axe changed, as cyl is now factor variable.

Because ggplot2 is a part of the tidyverse, it also uses the non-standard evaluation, and so this is equivalent to the example above, but without modifying the mtcars object:

ggplot(mtcars) +
  geom_cat(aes(x = factor(cyl), y = disp, size = carb), cat = "grumpy")

OK, time for a more serious example 🙂 Instead of geom_cat(), we will use a more practical geom_point(), which is showing classic points instead of cats, and map carb to point colors:

ggplot(mtcars2) +
  geom_point(aes(x = cyl, y = disp, color = carb))

Colors seems to be at continuous scale, right? We would rather want a discrete scale. Let’s fix it:

ggplot(mtcars2) +
  geom_point(aes(x = cyl, y = disp, color = factor(carb)))

Now you can see that ggplot2 has created the discrete scale, where each level of carb variable has its own color.

Aesthetics are inherited and overwritten by the last definition. In this way you haven’t to repeat common aesthetics for each geom. This is equivalent to the example above:

ggplot(mtcars2, aes(x = cyl, y = disp, color = factor(carb))) +
  geom_point()

If you add more geoms to the plot above, they inherit aesthetics defined in ggplot(), unless you specify inherit.aes = FALSE.

Let’s introduce more ggplot2 components:

  • We map hp variable to point size in geom_point().
  • We add one more geom, geom_text(), which is similar to geom_point(), but shows text instead of points.
  • scale_color_manual() translates discrete values (factor levels) to colors. In general, scales control the details of how data values are translated to visual properties.
  • labs() adds labels to aesthetics and main plot areas (e.g. title).
  • theme_bw() is a predefined theme with black and white colors.
  • theme() customizes the non-data components of your plots. We can override settings in predefined theme_*() themes.
p <- ggplot(mtcars, aes(x = qsec, y = mpg)) +
  geom_point(aes(color = factor(gear), size = hp)) +
  geom_text(aes(label = rownames(mtcars)), color = "black") +
  scale_color_manual(values = c("3" = "orchid4", "4" = "darkcyan", "5" = "darkorange1")) +
  labs(
    title = "mtcars", subtitle = "Acceleration vs. Fuel consumption",
    x = "1/4 mile time", y = "Miles per gallon", size = "Horse Power", color = "Gear"
  ) +
  theme_bw() +
  theme(
    plot.title = element_text(color = "chocolate4", face = "bold"),
    plot.subtitle = element_text(color = "orangered3"),
    axis.title = element_text(face = "bold")
  )

print(p)

You can see we saved our plot to the p variable. Anytime later, you can modify this variable (plot) by adding more geoms or theme settings.

Another useful ability of ggplot2 is facetting. facet_wrap() splits a plot to panels according to discrete variable, which is specified by a formula. We split the plot by am variable (transmission: 0 = automatic, 1 = manual) and define a translation of labels for its levels.

mtcars$am
##  [1] 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1
p + facet_wrap(~ am, labeller = labeller(am = c("0" = "Automatic", "1" = "Manual")))

We can specify multiple variables and plot is then created for every combination of their levels.

p + facet_wrap(~ am + cyl, labeller = labeller(am = c("0" = "Automatic", "1" = "Manual")))

Similar function is facet_grid() which creates a matrix of plots. Rows and columns are specified by LHS and RHS of a formula.

p + facet_grid(am ~ cyl, labeller = labeller(am = c("0" = "Automatic", "1" = "Manual")))

Again, multiple variables are supported, for both LHS and RHS.

p + facet_grid(am + vs ~ cyl, labeller = labeller(am = c("0" = "Automatic", "1" = "Manual")))

Export of plots to image files is done via ggsave() function. You can use any format R can handle.

ggsave(here("E02-intro_to_R/ggplot2_example.png"), p)
ggsave(here("E02-intro_to_R/ggplot2_example.pdf"), p)

ggplot2_example.png

ggplot2_example.pdf

Alternatively, you can use base R functions. This is useful for saving multiple plots to a single multipage PDF.

pdf("ggplot2_example.pdf")
print(p)
dev.off()

Using dynamic variable names in aesthetics

You probably wondered how can you use dynamic variable names in aes(). For example, consider a function plot_pca(data, color_by) which calculates PCA and shows scatterplot colorized by a specified variable. The easiest way is to use the aes_string() function:

color_by <- "cyl"

ggplot(mtcars) +
  geom_point(aes_string(x = "mpg", y = "qsec", color = color_by))

The disadvantage is that you cannot use the capabilities of NSE, so in the example above, you cannot use factor(color_by) to treat cyl as a factor variable.

But because ggplot2 is a part of the tidyverse, you can use things described in the Using dynamic variable names section, such as the double brackets {{ }}:

fn <- function(df, color_by) {
  ggplot(df, aes(x = am, y = mpg, color = {{ color_by }})) +
    geom_point()
}

fn(mtcars, factor(cyl))

Packages extending the ggplot2

patchwork - the composer of plots

patchwork makes ridiculously simple to combine separate ggplots into the same graphic (called “patchwork”). Shortly, it defines arithmetic for ggplots and several functions to place plots into grid.

Mainly examples provided by patchwork itself are used. First, we create some plots:

library(patchwork)
p1 <- ggplot(mtcars) + geom_point(aes(mpg, disp))
p2 <- ggplot(mtcars) + geom_boxplot(aes(gear, disp, group = gear))
p3 <- ggplot(mtcars) + geom_smooth(aes(disp, qsec))
p4 <- ggplot(mtcars) + geom_bar(aes(carb))

Now we can combine them to a single figure:

p1 + p2

p1 / p2

+ and / combines plots into row and column, respectively. However, by default, patchwork will try to keep the grid square, and fill it out in row order:

p1 + p2 + p3 + p4

We can control this by plot_layout()

p1 + p2 + p3 + p4 + plot_layout(nrow = 3, byrow = FALSE)

Each combined plot can be combined with any other ggplot or combined plot (i.e. you can create nested patchworks):

p1 + (p2 / p3)

p1 / (p2 + p3)

Patchwork can be annotated and individual plots tagged:

(p1 | (p2 / p3)) + 
  plot_annotation(title = "The surprising story about mtcars", tag_levels = "A")

If you just need to make a grid of plots, use wrap_plots():

# Or you can pass a list of plots:
# plots <- list(p1, p2, p3, p4) 
# wrap_plots(plots)
wrap_plots(p1, p2, p3, p4)

wrap_plots(p1, p2, p3, p4, nrow = 1)

ggpubr - publication-ready plots

ggpubr::ggboxplot(
  mtcars, x = "cyl", y = "qsec",
  color = "cyl", shape = "cyl",
  add = "jitter"
)

GGally - useful geoms

GGally::ggpairs(mtcars, columns = c("cyl", "mpg", "qsec"))

ggforce - cool visualizations

ggplot(mtcars) +
  geom_point(aes(x = hp, y = qsec, color = factor(gear))) +
  ggforce::facet_zoom(x = gear == 4) +
  theme_bw()

ggcorrplot - visualize a correlation matrix

corr <- round(cor(mtcars), 1)
corr[1:5, 1:5]
##       mpg  cyl disp   hp drat
## mpg   1.0 -0.9 -0.8 -0.8  0.7
## cyl  -0.9  1.0  0.9  0.8 -0.7
## disp -0.8  0.9  1.0  0.8 -0.7
## hp   -0.8  0.8  0.8  1.0 -0.4
## drat  0.7 -0.7 -0.7 -0.4  1.0
p_mat <- ggcorrplot::cor_pmat(mtcars)
p_mat[1:5, 1:5]
##               mpg          cyl         disp           hp         drat
## mpg  0.000000e+00 6.112687e-10 9.380327e-10 1.787835e-07 1.776240e-05
## cyl  6.112687e-10 0.000000e+00 1.802838e-12 3.477861e-09 8.244636e-06
## disp 9.380327e-10 1.802838e-12 0.000000e+00 7.142679e-08 5.282022e-06
## hp   1.787835e-07 3.477861e-09 7.142679e-08 0.000000e+00 9.988772e-03
## drat 1.776240e-05 8.244636e-06 5.282022e-06 9.988772e-03 0.000000e+00
p_corr <- ggcorrplot::ggcorrplot(corr, method = "circle", p.mat = p_mat)
print(p_corr)

ggrepel - repel overlapping text labels

ggplot(mtcars2, aes(x = qsec, y = mpg)) +
  geom_point(aes(color = factor(gear), size = hp)) +
  ggrepel::geom_text_repel(aes(label = rownames(mtcars)), color = "black")

Additional themes

ggsci - a collection of color palettes inspired by (mainly) scientific journals

Nature Publishing Group color palette as an example:

p1 <- ggplot(mtcars) +
  geom_point(aes(x = hp, y = qsec, color = factor(cyl))) +
  theme_bw() +
  ggsci::scale_color_npg()

p2 <- ggplot(mtcars) +
  geom_histogram(aes(x = qsec, fill = factor(cyl)), colour = "black", binwidth = 1, position = "dodge") +
  theme_bw() +
  ggsci::scale_fill_npg()

p1 + p2

hrbrthemes - themes and color palettes

ggplot(mtcars) +
  geom_point(aes(x = hp, y = qsec, color = factor(cyl))) +
  hrbrthemes::theme_modern_rc()

see - themes and color palettes

ggplot(mtcars) +
  geom_point(aes(x = hp, y = qsec, color = factor(cyl))) +
  see::theme_abyss()

cowplot

Not really a collection of e.g. colors, but there are useful visual helpers with explanation.

Other useful packages

janitor - table summaries, duplicates etc.

Personally, I am mainly using these functions:

tabyl() (vignette) creates a frequency table. It is a tidy replacement for base R’s table():

table(mtcars$cyl)
## 
##  4  6  8 
## 11  7 14
janitor::tabyl(mtcars, cyl)
table(mtcars$cyl, mtcars$gear)
##    
##      3  4  5
##   4  1  8  2
##   6  2  4  1
##   8 12  0  2
janitor::tabyl(mtcars, cyl, gear)
table(mtcars$cyl, mtcars$gear, mtcars$am)
## , ,  = 0
## 
##    
##      3  4  5
##   4  1  2  0
##   6  2  2  0
##   8 12  0  0
## 
## , ,  = 1
## 
##    
##      3  4  5
##   4  0  6  2
##   6  0  2  1
##   8  0  0  2
janitor::tabyl(mtcars, cyl, gear, am)
## $`0`
##  cyl  3 4 5
##    4  1 2 0
##    6  2 2 0
##    8 12 0 0
## 
## $`1`
##  cyl 3 4 5
##    4 0 6 2
##    6 0 2 1
##    8 0 0 2

adorn_*() append row and/or column with summary statistics:

janitor::tabyl(mtcars, cyl, gear) %>%
  janitor::adorn_totals(c("row", "col")) %>%
  janitor::adorn_percentages("row") %>%
  janitor::adorn_pct_formatting(rounding = "half up", digits = 0) %>%
  janitor::adorn_ns() %>%
  as.data.frame()

get_dupes() returns duplicated rows:

janitor::get_dupes(mtcars, mpg, hp)

plotly - interactive HTML plots

plotly is a JavaScript library with wrappers for R and Python. It is somewhat similar to ggplot2 in terms of The Grammar of Graphics. Output is a HTML file or in case of RMarkdown, it can be directly embedded as a chunk output.

We don’t go into details, so if you are interested, see examples. We will try to replicate the mtcars scatterplot done in ggplot2:

p_ggplot2 <- ggplot(mtcars2, aes(x = qsec, y = mpg)) +
  geom_point(aes(color = factor(gear), size = hp)) +
  theme_bw()

print(p_ggplot2)

p_plotly <- plotly::plot_ly(
  data = mtcars,
  x = ~qsec, y = ~mpg, color = ~factor(gear), size = ~hp,
  type = "scatter", mode = "markers",
  text = ~glue("mpg: {mpg}<br>qsec: {qsec}<br>gear: {gear}<br>hp: {hp}")
)

p_plotly

plotly can also convert ggplot2 objects, but it is not always accurate:

plotly::ggplotly(p_ggplot2)

But this seems to be quite good, even better than p_plotly ☺️

plotly produces objects inheriting from htmlwidget class. We can use htmlwidgets package to save plotly object into HTML file:

class(p_plotly)
## [1] "plotly"     "htmlwidget"
htmlwidgets::saveWidget(p_plotly, here("E02-intro_to_R/p_plotly.html"))

p_plotly.html

heatmaply - interactive HTML heatmaps

heatmaply uses plotly underneath. Its API is mostly compatible with the classic heatmap.2() function.

heatmaply::heatmaply(mtcars)

It has also functionality similar to ggcorrplot:

cor <- psych::corr.test(mtcars)

heatmaply::heatmaply_cor(
  cor$r,
  node_type = "scatter",
  point_size_mat = -log10(cor$p), 
  point_size_name = "-log10(p-value)",
  label_names = c("x", "y", "Correlation")
)

ComplexHeatmap

ComplexHeatmap is probably the most complex package for heatmap visualization, while still providing very user-friendly approach.

ComplexHeatmap was initially created for complex multi-omics datasets, combining e.g. gene expression with methylation levels. Its design follows the object-oriented scheme, so heatmap body, annotations, legends etc. are self-contained building blocks. It can do a lot and it can be a bit overhelming, but for simple heatmaps (e.g. z-score of logcounts), unlike heatmap.2 and similar packages, ComplexHeatmap is more user-friendly and provides you a great assistance.

Some time ago, I have created an introduction for my labmates, which is available in this GitHub repository, but you can find all the necessary files in E02-intro_to_R/ComplexHeatmap-intro: there is a source Rmd file (ComplexHeatmap.Rmd) and its rendered HTML document (ComplexHeatmap.html).

Although there are many popular heatmap packages, I will deliberately not mention them here, as they have many drawbacks. In my opinion, ComplexHeatmap is currently the best heatmap package, especially for genomics and transcriptomics.

BiocParallel

BiocParallel package provides easy-to-use functions for parallel execution. We won’t go into details and just introduce probably the most used function bplapply(), a parallelized version of lapply().

As you know, lapply(X, FUN, ...) applies a function FUN with optional arguments ... to each element of X, returning a list of values returned from the FUN:

x <- 1:5
fn <- function(i) i^2
lapply(x, fn)
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 4
## 
## [[3]]
## [1] 9
## 
## [[4]]
## [1] 16
## 
## [[5]]
## [1] 25

bplapply(X, FUN, ..., BPREDO = list(), BPPARAM = bpparam()) does the same, but elements in X are first split to chunks, which are processed in parallel. Result is guaranteed to have the same order of elements as in X.

BiocParallel::bplapply(x, fn)
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 4
## 
## [[3]]
## [1] 9
## 
## [[4]]
## [1] 16
## 
## [[5]]
## [1] 25

By default, all available cores are used. To control for resources, use the BPPARAM argument:

(bpparam <- BiocParallel::MulticoreParam(workers = 2))
## class: MulticoreParam
##   bpisup: FALSE; bpnworkers: 2; bptasks: 0; bpjobname: BPJOB
##   bplog: FALSE; bpthreshold: INFO; bpstopOnError: TRUE
##   bpRNGseed: ; bptimeout: 2592000; bpprogressbar: FALSE
##   bpexportglobals: TRUE; bpforceGC: TRUE
##   bplogdir: NA
##   bpresultdir: NA
##   cluster type: FORK
BiocParallel::bplapply(x, fn, BPPARAM = bpparam)

Many functions in BioConductor packages will offer you an argument to pass a BPPARAM object. If you are working on some high-performance server, this can speed up computations a lot.

Parallel processing is hard to debug, even with try-catch equivalent function bptry(). Also, you can’t use browser() inside FUN. But sometimes is just enough to run bplapply() in non-parallel (serial) mode. This is possible with SerialParam().

(bpparam <- BiocParallel::SerialParam())
fn <- function(i) {
  browser()
  i^2
}

BiocParallel::bplapply(x, fn, BPPARAM = bpparam)

BiocParallel supports jobs in distributed systems and much more, but for now we can live only with the simple multicore usage.


Cleanup

save(list = ls(all.names = TRUE), file = here("E02-intro_to_R/intro_to_R.RData"), envir = environment())

warnings()
traceback()
## No traceback available
sessioninfo::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.1.2 (2021-11-01)
##  os       Debian GNU/Linux 10 (buster)
##  system   x86_64, linux-gnu
##  ui       X11
##  language (EN)
##  collate  en_US.UTF-8
##  ctype    en_US.UTF-8
##  tz       Europe/Prague
##  date     2022-02-16
##  pandoc   2.11.2 @ /usr/lib/rstudio-server/bin/pandoc/ (via rmarkdown)
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package          * version    date (UTC) lib source
##  abind              1.4-5      2016-07-21 [1] CRAN (R 4.1.2)
##  AnnotationDbi      1.56.2     2021-11-09 [1] Bioconductor
##  assertthat         0.2.1      2019-03-21 [1] CRAN (R 4.1.2)
##  backports          1.4.1      2021-12-13 [1] CRAN (R 4.1.2)
##  Biobase            2.54.0     2021-10-26 [1] Bioconductor
##  BiocGenerics       0.40.0     2021-10-26 [1] Bioconductor
##  BiocParallel       1.28.3     2021-12-09 [1] Bioconductor
##  Biostrings         2.62.0     2021-10-26 [1] Bioconductor
##  bit                4.0.4      2020-08-04 [1] CRAN (R 4.1.2)
##  bit64              4.0.5      2020-08-30 [1] CRAN (R 4.1.2)
##  bitops             1.0-7      2021-04-24 [1] CRAN (R 4.1.2)
##  blob               1.2.2      2021-07-23 [1] CRAN (R 4.1.0)
##  bookdown           0.24       2021-09-02 [1] CRAN (R 4.1.1)
##  broom              0.7.12     2022-01-28 [1] CRAN (R 4.1.2)
##  bslib              0.3.1      2021-10-06 [1] CRAN (R 4.1.2)
##  cachem             1.0.6      2021-08-19 [1] CRAN (R 4.1.1)
##  car                3.0-12     2021-11-06 [1] CRAN (R 4.1.2)
##  carData            3.0-5      2022-01-06 [1] CRAN (R 4.1.2)
##  cellranger         1.1.0      2016-07-27 [1] CRAN (R 4.1.2)
##  cli                3.1.1      2022-01-20 [1] CRAN (R 4.1.2)
##  codetools          0.2-18     2020-11-04 [1] CRAN (R 4.1.2)
##  colorspace         2.0-2      2021-06-24 [1] CRAN (R 4.1.0)
##  crayon             1.4.2      2021-10-29 [1] CRAN (R 4.1.2)
##  crosstalk          1.2.0      2021-11-04 [1] CRAN (R 4.1.2)
##  data.table         1.14.2     2021-09-27 [1] CRAN (R 4.1.1)
##  DBI                1.1.2      2021-12-20 [1] CRAN (R 4.1.2)
##  dbplyr             2.1.1      2021-04-06 [1] CRAN (R 4.1.2)
##  dendextend         1.15.2     2021-10-28 [1] CRAN (R 4.1.2)
##  digest             0.6.29     2021-12-01 [1] CRAN (R 4.1.2)
##  dplyr            * 1.0.8      2022-02-08 [1] CRAN (R 4.1.2)
##  ellipsis           0.3.2      2021-04-29 [1] CRAN (R 4.1.2)
##  emo                0.0.0.9000 2021-02-26 [1] Github (hadley/emo@3f03b11)
##  evaluate           0.14       2019-05-28 [1] CRAN (R 4.1.2)
##  extrafont          0.17       2014-12-08 [1] CRAN (R 4.1.2)
##  extrafontdb        1.0        2012-06-11 [1] CRAN (R 4.1.2)
##  fansi              1.0.2      2022-01-14 [1] CRAN (R 4.1.2)
##  farver             2.1.0      2021-02-28 [1] CRAN (R 4.1.2)
##  fastmap            1.1.0      2021-01-25 [1] CRAN (R 4.1.2)
##  forcats          * 0.5.1      2021-01-27 [1] CRAN (R 4.1.2)
##  foreach            1.5.2      2022-02-02 [1] CRAN (R 4.1.2)
##  fs                 1.5.2      2021-12-08 [1] CRAN (R 4.1.2)
##  gdtools            0.2.3      2021-01-06 [1] CRAN (R 4.1.2)
##  generics           0.1.2      2022-01-31 [1] CRAN (R 4.1.2)
##  GenomeInfoDb       1.30.1     2022-01-30 [1] Bioconductor
##  GenomeInfoDbData   1.2.7      2022-02-09 [1] Bioconductor
##  GGally             2.1.2      2021-06-21 [1] CRAN (R 4.1.0)
##  ggcats           * 1.0        2021-02-26 [1] Github (R-CoderDotCom/ggcats@a7e5e37)
##  ggcorrplot         0.1.3      2019-05-19 [1] CRAN (R 4.1.2)
##  ggforce            0.3.3      2021-03-05 [1] CRAN (R 4.1.2)
##  ggplot2          * 3.3.5      2021-06-25 [1] CRAN (R 4.1.0)
##  ggpubr             0.4.0      2020-06-27 [1] CRAN (R 4.1.2)
##  ggrepel            0.9.1      2021-01-15 [1] CRAN (R 4.1.2)
##  ggsci              2.9        2018-05-14 [1] CRAN (R 4.1.2)
##  ggsignif           0.6.3      2021-09-09 [1] CRAN (R 4.1.1)
##  glue             * 1.6.1      2022-01-22 [1] CRAN (R 4.1.2)
##  gridExtra          2.3        2017-09-09 [1] CRAN (R 4.1.2)
##  gtable             0.3.0      2019-03-25 [1] CRAN (R 4.1.2)
##  haven              2.4.3      2021-08-04 [1] CRAN (R 4.1.0)
##  heatmaply          1.3.0      2021-10-09 [1] CRAN (R 4.1.2)
##  here             * 1.0.1      2020-12-13 [1] CRAN (R 4.1.2)
##  highr              0.9        2021-04-16 [1] CRAN (R 4.1.2)
##  hms                1.1.1      2021-09-26 [1] CRAN (R 4.1.1)
##  hrbrthemes         0.8.0      2020-03-06 [1] CRAN (R 4.1.2)
##  htmltools          0.5.2      2021-08-25 [1] CRAN (R 4.1.1)
##  htmlwidgets        1.5.4      2021-09-08 [1] CRAN (R 4.1.1)
##  httr               1.4.2      2020-07-20 [1] CRAN (R 4.1.2)
##  IRanges            2.28.0     2021-10-26 [1] Bioconductor
##  iterators          1.0.14     2022-02-05 [1] CRAN (R 4.1.2)
##  janitor            2.1.0      2021-01-05 [1] CRAN (R 4.1.2)
##  jquerylib          0.1.4      2021-04-26 [1] CRAN (R 4.1.2)
##  jsonlite           1.7.3      2022-01-17 [1] CRAN (R 4.1.2)
##  KEGGREST           1.34.0     2021-10-26 [1] Bioconductor
##  klippy           * 0.0.0.9500 2021-02-26 [1] Github (rlesur/klippy@378c247)
##  knitr              1.37       2021-12-16 [1] CRAN (R 4.1.2)
##  labeling           0.4.2      2020-10-20 [1] CRAN (R 4.1.2)
##  lattice            0.20-45    2021-09-22 [1] CRAN (R 4.1.1)
##  lazyeval           0.2.2      2019-03-15 [1] CRAN (R 4.1.2)
##  lifecycle          1.0.1      2021-09-24 [1] CRAN (R 4.1.1)
##  lubridate          1.8.0      2021-10-07 [1] CRAN (R 4.1.2)
##  magick             2.7.3      2021-08-18 [1] CRAN (R 4.1.1)
##  magrittr         * 2.0.2      2022-01-26 [1] CRAN (R 4.1.2)
##  MASS               7.3-55     2022-01-13 [3] CRAN (R 4.1.2)
##  Matrix             1.4-0      2021-12-08 [3] CRAN (R 4.1.2)
##  memoise            2.0.1      2021-11-26 [1] CRAN (R 4.1.2)
##  mgcv               1.8-38     2021-10-06 [3] CRAN (R 4.1.1)
##  mnormt             2.0.2      2020-09-01 [1] CRAN (R 4.1.2)
##  modelr             0.1.8      2020-05-19 [1] CRAN (R 4.1.2)
##  munsell            0.5.0      2018-06-12 [1] CRAN (R 4.1.2)
##  nlme               3.1-155    2022-01-13 [3] CRAN (R 4.1.2)
##  patchwork        * 1.1.1      2020-12-17 [1] CRAN (R 4.1.2)
##  pillar             1.7.0      2022-02-01 [1] CRAN (R 4.1.2)
##  pkgconfig          2.0.3      2019-09-22 [1] CRAN (R 4.1.2)
##  plotly             4.10.0     2021-10-09 [1] CRAN (R 4.1.2)
##  plyr               1.8.6      2020-03-03 [1] CRAN (R 4.1.2)
##  png                0.1-7      2013-12-03 [1] CRAN (R 4.1.2)
##  polyclip           1.10-0     2019-03-14 [1] CRAN (R 4.1.2)
##  psych              2.1.9      2021-09-22 [1] CRAN (R 4.1.1)
##  purrr            * 0.3.4      2020-04-17 [1] CRAN (R 4.1.2)
##  R6                 2.5.1      2021-08-19 [1] CRAN (R 4.1.1)
##  ragg               1.2.1      2021-12-06 [1] CRAN (R 4.1.2)
##  RColorBrewer       1.1-2      2014-12-07 [1] CRAN (R 4.1.2)
##  Rcpp               1.0.8      2022-01-13 [1] CRAN (R 4.1.2)
##  RCurl              1.98-1.6   2022-02-08 [1] CRAN (R 4.1.2)
##  readr            * 2.1.2      2022-01-30 [1] CRAN (R 4.1.2)
##  readxl             1.3.1      2019-03-13 [1] CRAN (R 4.1.2)
##  registry           0.5-1      2019-03-05 [1] CRAN (R 4.1.2)
##  reprex             2.0.1      2021-08-05 [1] CRAN (R 4.1.0)
##  reshape            0.8.8      2018-10-23 [1] CRAN (R 4.1.2)
##  reshape2           1.4.4      2020-04-09 [1] CRAN (R 4.1.2)
##  rlang              1.0.1      2022-02-03 [1] CRAN (R 4.1.2)
##  rmarkdown          2.11       2021-09-14 [1] CRAN (R 4.1.1)
##  rmdformats         1.0.3      2021-10-06 [1] CRAN (R 4.1.2)
##  rprojroot          2.0.2      2020-11-15 [1] CRAN (R 4.1.2)
##  RSQLite            2.2.9      2021-12-06 [1] CRAN (R 4.1.2)
##  rstatix            0.7.0      2021-02-13 [1] CRAN (R 4.1.2)
##  rstudioapi         0.13       2020-11-12 [1] CRAN (R 4.1.2)
##  Rttf2pt1           1.3.10     2022-02-07 [1] CRAN (R 4.1.2)
##  rvest              1.0.2      2021-10-16 [1] CRAN (R 4.1.2)
##  S4Vectors          0.32.3     2021-11-21 [1] Bioconductor
##  sass               0.4.0      2021-05-12 [1] CRAN (R 4.1.0)
##  scales             1.1.1      2020-05-11 [1] CRAN (R 4.1.2)
##  see                0.6.8      2021-10-03 [1] CRAN (R 4.1.2)
##  seriation          1.3.1      2021-10-16 [1] CRAN (R 4.1.2)
##  sessioninfo        1.2.2      2021-12-06 [1] CRAN (R 4.1.2)
##  snakecase          0.11.0     2019-05-25 [1] CRAN (R 4.1.2)
##  stringi            1.7.6      2021-11-29 [1] CRAN (R 4.1.2)
##  stringr          * 1.4.0      2019-02-10 [1] CRAN (R 4.1.2)
##  systemfonts        1.0.3      2021-10-13 [1] CRAN (R 4.1.2)
##  textshaping        0.3.6      2021-10-13 [1] CRAN (R 4.1.2)
##  tibble           * 3.1.6      2021-11-07 [1] CRAN (R 4.1.2)
##  tidyr            * 1.2.0      2022-02-01 [1] CRAN (R 4.1.2)
##  tidyselect         1.1.1      2021-04-30 [1] CRAN (R 4.1.2)
##  tidyverse        * 1.3.1      2021-04-15 [1] CRAN (R 4.1.2)
##  tmvnsim            1.0-2      2016-12-15 [1] CRAN (R 4.1.2)
##  TSP                1.1-11     2021-10-06 [1] CRAN (R 4.1.2)
##  tweenr             1.0.2      2021-03-23 [1] CRAN (R 4.1.2)
##  tzdb               0.2.0      2021-10-27 [1] CRAN (R 4.1.2)
##  utf8               1.2.2      2021-07-24 [1] CRAN (R 4.1.0)
##  vctrs              0.3.8      2021-04-29 [1] CRAN (R 4.1.2)
##  viridis            0.6.2      2021-10-13 [1] CRAN (R 4.1.2)
##  viridisLite        0.4.0      2021-04-13 [1] CRAN (R 4.1.2)
##  webshot            0.5.2      2019-11-22 [1] CRAN (R 4.1.2)
##  withr              2.4.3      2021-11-30 [1] CRAN (R 4.1.2)
##  xfun               0.29       2021-12-14 [1] CRAN (R 4.1.2)
##  xml2               1.3.3      2021-11-30 [1] CRAN (R 4.1.2)
##  XVector            0.34.0     2021-10-26 [1] Bioconductor
##  yaml               2.2.2      2022-01-25 [1] CRAN (R 4.1.2)
##  zlibbioc           1.40.0     2021-10-26 [1] Bioconductor
## 
##  [1] /usr/local/lib/R/site-library
##  [2] /usr/lib/R/site-library
##  [3] /usr/lib/R/library
## 
## ──────────────────────────────────────────────────────────────────────────────


HTML rendering

This chunk is not evaluated (eval = FALSE). Otherwise you will probably end up in recursive hell 🤯

library(conflicted)
library(knitr)
library(glue)
library(here)

if (!require(rmdformats)) {
  BiocManager::install("rmdformats")
}

# You can set global chunk options. Options set in individual chunks will override this.
opts_chunk$set(warning = FALSE, message = FALSE, eval = TRUE)
rmarkdown::render(
  here("E02-intro_to_R/intro_to_R.Rmd"),
  output_file = here("E02-intro_to_R/intro_to_R.html"),
  envir = new.env(),
  knit_root_dir = here()
)